Research Project Report: Spark, BlinkDB and Sampling

نویسنده

Qiming Shao

چکیده

During the 2015-2016 academic year, I conducted research about Spark, BlinkDB and various sampling techniques. This research helped the team have a better understanding of the capabilities and properties of Spark, BlinkDB system and different sampling technologies. Additionally, I benchmarked and implemented various Machine learning and sampling methods on the top of both Spark and BlinkDB. There are two parts from my work: Part I (Fall 2015) Research on Spark and Part II (Spring 2016) Probability and Sampling Techniques and Systems

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Blink and It's Done: Interactive Queries on Very Large Data

In this demonstration, we present BlinkDB, a massively parallel, sampling-based approximate query processing framework for running interactive queries on large volumes of data. The key observation in BlinkDB is that one can make reasonable decisions in the absence of perfect answers. BlinkDB extends the Hive/HDFS stack and can handle the same set of SPJA (selection, projection, join and aggrega...

متن کامل

Review of Diesel Particulate Matter Sampling Methods Final Report

.........................................................................................................................3 INTRODUCTION ................................................................................................................5 DIESEL ENGINE TECHNOLOGY AND EMISSION REGULATIONS .............................7 PHYSICAL AND CHEMICAL NATURE OF DIESEL AEROSOL ......................

متن کامل

Tokeneer: Beyond Formal Program Verification

Tokeneer is a small-sized (10 kloc) security system which was formally developed and verified by Praxis at the request of NSA, using SPARK technology. Since its open-source release in 2008, only two problems were found, one by static analysis, one by code review. In this paper, we report on experiments where we systematically applied various static analysis tools (compiler, bug-finder, proof to...

متن کامل

Approximate Stream Analytics in Apache Flink and Apache Spark Streaming

Approximate computing aims for efficient execution of workflows where an approximate output is sufficient instead of the exact output. The idea behind approximate computing is to compute over a representative sample instead of the entire input dataset. Thus, approximate computing — based on the chosen sample size — can make a systematic trade-off between the output accuracy and computation effi...

متن کامل

Parallel Maritime Traffic Clustering Based on Apache Spark

Maritime traffic patterns extraction is an essential part for maritime security and surveillance and DBSCANSD is a density based clustering algorithm extracting the arbitrary shapes of the normal lanes from AIS data. This paper presents a parallel DBSCANSD algorithm on top of Apache Spark. The project is an experimental research work and the results shown in this paper is preliminary. The exper...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره شماره

صفحات -

تاریخ انتشار 2016

Research Project Report: Spark, BlinkDB and Sampling

نویسنده

چکیده

منابع مشابه

Blink and It's Done: Interactive Queries on Very Large Data

Review of Diesel Particulate Matter Sampling Methods Final Report

Tokeneer: Beyond Formal Program Verification

Approximate Stream Analytics in Apache Flink and Apache Spark Streaming

Parallel Maritime Traffic Clustering Based on Apache Spark

عنوان ژورنال:

اشتراک گذاری